MP968 Experimental Design Workshop

Leighton Pritchard

University of Strathclyde

2025-11-24

Why do we need experimental design?

We should not cause unnecessary suffering

We should always minimise suffering

This may mean not performing an experiment at all. Not all new knowledge or understanding is worth causing suffering to obtain it.

Where there is sufficient justification to perform an experiment, we are ethically obliged to minimise the amount of distress or suffering that is caused, by designing the experiment to achieve this.

Why we need statistics

It may be easy to tell whether an animal is well-treated, or whether an experiment is necessary.

But what is an acceptable (i.e. the least possible) amount of suffering necessary to obtain an informative result?

Challenge

Quiz question

Suppose you are running a necessary and useful experiment with animal subjects, where the use of animals is morally justified. You are comparing a treatment group to a control group. Which of the following choices will cause the least amount of suffering?

  • Use three subjects per group so a standard deviation can be calculated
  • Use just enough subjects to establish that the outcome is likely to be correct
  • Use just enough subjects to be certain that the outcome is correct
  • Use as many subjects as you have available, to avoid wastage

How many individuals?

The appropriate number of subjects

The appropriate number of animal subjects to use in an experiment is always the smallest number that - given reasonable assumptions - will satisfactorily give the correct result to the desired level of certainty.

  • What assumptions are reasonable?
  • What is an appropriate level of certainty?

By convention1 the usual level of certainty for a hypothesis test is: “we have an 80% chance of getting the correct true/false answer for the hypothesis being tested”

Design experiments to minimise suffering

Experimental design and statistics are intertwined

Once a research hypothesis has been devised:

  • Experimental design is the process of devising a practical way of answering the question
  • Statistics informs the choices of variables, controls, numbers of individuals and groups, and the appropriate analysis of results

Design your experiment for…

  • your population or subject group (e.g. sex, age, prior history, etc.)
  • your intervention (e.g. drug treatment)
  • your contrast or comparison between groups (e.g. lung capacity, drug concentration, etc.)
  • your outcome (i.e. is there a measurable or clinically relevant effect)

The 2009 NC3Rs systematic survey

The importance of experimental design

“For scientific, ethical and economic reasons, experiments involving animals should be appropriately designed, correctly analysed and transparently reported. This increases the scientific validity of the results, and maximises the knowledge gained from each experiment. A minimum amount of relevant information must be included in scientific publications to ensure that the methods and results of a study can be reviewed, analysed and repeated. Omitting essential information can raise scientific and ethical concerns.” (Kilkenny et al. (2009))

We rely on the reporting of the experiment to know if it was appropriate

Causes for concern 1

“Detailed information was collected from 271 publications, about the objective or hypothesis of the study, the number, sex, age and/or weight of animals used, and experimental and statistical methods. Only 59% of the studies stated the hypothesis or objective of the study and the number and characteristics of the animals used. […] Most of the papers surveyed did not use randomisation (87%) or blinding (86%), to reduce bias in animal selection and outcome assessment. Only 70% of the publications that used statistical methods described their methods and presented the results with a measure of error or variability.” (Kilkenny et al. (2009))

We cannot rely on the literature for good examples of experimental design

Causes for concern 2

No publication explained their choice for the number of animals used

We cannot rely on the verbal authority of ‘published scientists’ or ‘experienced scientists’ for good experimental design

Very strong cause for concern

Power analysis or other very simple calculations, which are widely used in human clinical trials and are often expected by regulatory authorities in some animal studies, can help to determine an appropriate number of animals to use in an experiment in order to detect a biologically important effect if there is one. This is a scientifically robust and efficient way of determining animal numbers and may ultimately help to prevent animals being used unnecessarily. Many of the studies that did report the number of animals used reported the numbers inconsistently between the methods and results sections. The reason for this is unclear, but this does pose a significant problem when analysing, interpreting and repeating the results.” (Kilkenny et al. (2009))

Important

As scientists, you - yourselves - need to understand the principles behind the statistical tests you use, in order to choose appropriate tests and methods, and to use appropriate measures to minimise animal suffering and obtain meaningful results.

You cannot simply rely on the word of “experienced scientists” for this.

The ARRIVE guidelines

The following year Kilkenny et al. (2010) proposed the ARRIVE guidelines: a checklist to help researchers report their animal research transparently and reproducibly.

  • Good reporting is essential for peer review and to inform future research
  • Reporting guidelines measurably improve reporting quality
  • Improved reporting maximises the output of published research

ARRIVE guidelines highlightes

Many journals now routinely request information in the ARRIVE framework, often as electronic supplementary information. The framework covers 20 items including the following (Kilkenny et al. (2010)):

ARRIVE guidelines (highlights)

    1. Objectives: primary and any secondary objectives of the study, or specific hypotheses being tested
    1. Study design: brief details of the study design, including the number of experimental and control groups, any steps taken to minimise the effects of subjective bias, and the experimental unit
    1. Sample size: the total number of animals used in each experiment and the number of animals in each experimental group; how the number of animals was decided
    1. Statistical methods: details of the statistical methods used for each analysis; methods used to assess whether the data met the assumptions of the statistical approach
    1. Outcomes and estimation: results for each analysis carried out, with a measure of precision (e.g., standard error or confidence interval).

A vital step

Warning

“A key step in tackling these issues is to ensure that the next generation of scientists are aware of what makes for good practice in experimental design and animal research, and that they are not led into poor or inappropriate practices by more senior scientists without a proper grasp of these issues.”

Recommended reading

Bate and Clark (2014)

Some Statistical Concepts

Random variables

Your experimental measurements are random variables

Important

This does not mean that your measurements are entirely random numbers

Caution

Random variables are values whose range is subject to some element of chance, e.g. variation between individuals

  • Tail length (e.g. timing of developmental signals, distribution of nutrients)
  • Blood concentrations (e.g. circulatory heterogeneity, transient measurement differences)
  • Survival time (e.g. determining point of death)

Probability distributions

The probability distribution of a random variable \(z\) (e.g. what you measure in an experiment) takes on some range of values1

The mean of the distribution of \(z\)

  • The mean (aka expected value or expectation) is the average of all the values in \(z\)
    • Equivalently: the mean is the value that is obtained on average from a random sample from the distribution
  • Written as \(\mu_{z}\) or \(E(z)\)

The variance of a distribution of \(z\)

  • The variance of the distribution of \(z\) represents the expected mean squared difference from the mean \(\mu_z\) (or \(E(z)\)) of a random sample from the distribution.
    • \(\textrm{variance} = E((z - \mu_z)^2)\)

Understanding variance

A distribution where all values of \(z\) are the same

  • Every single value in the distribution (\(z\)) is also the mean value (\(\mu_z\)), therefore

\[z = \mu_z \implies z - \mu_z = 0 \implies (z - \mu_z)^2 = 0\] \[\textrm{variance} = E((z - \mu_z)^2) = E(0^2) = 0\]

All other distributions

In every other distribution, there are some values of \(z\) that differ so, for at least some values of \(z\)

\[z \neq \mu_z \implies z - \mu_z \neq 0 \implies (z - \mu_z)^2 \gt 0 \] \[\implies \textrm{variance} = E((z - \mu_z)^2) \gt 0 \]

Standard deviation

Standard deviation is the square root of the variance

\[\textrm{standard deviation} = \sigma_z = \sqrt{\textrm{variance}} = \sqrt{E((z - \mu_z)^2)} \]

Advantages

  • The standard deviation (unlike variance) takes values on the same scale as the original distribution
    • Standard deviation is a more “natural-seeming” interpretation of variation

Note

We can calculate mean, variance, and standard deviation for any probability distribution.

Normal Distribution 1

\[ z \sim \textrm{normal}(\mu_z, \sigma_z) \]

Note

We only need to know the mean and standard deviation to define a unique normal distribution

Tip

Measurements of variables whose value is the sum of many small, independent, additive factors may follow a normal distribution

Important

There is no reason to expect that a random variable representing direct measurements in the world will be normally distributed!

Normal Distribution 2

Tip

  • For a normal distribution, the mean value is the value at the peak of the curve
  • The curve is symmetrical, so standard deviation describes variability equally well on both sides of the mean

(Non-)Normal Distribution 3

Tip

  • Here, the mean may not be the same value as the peak of the curve (i.e. the mode)
  • The curve is asymmetrical, so standard deviation does not describe variation equally well on either side of the mean

Binomial Distribution 1

Suppose you’re taking shots in basketball

  • how many shots?
  • how likely are you to score?
  • what is the distribution of the number of successful shots?

Tip

This kind of process generates a random variable approximating a probability distribution called a binomial distribution.

It is different from a normal distribution.

Binomial Distribution 2

\[ z \sim \textrm{binomial}(n, p) \]

Tip

  • number of shots, \(n = 20\)
  • probability of scoring, \(p = 0.3\)

\[z \sim \textrm{binomial}(20, 0.3) \]

mean and sd

\[ \textrm{mean} = n \times p \] \[ \textrm{sd} = \sqrt{n \times p \times (1-p)}\]

Design note

You need to design your experiments and analyses to reflect the appropriate process/probability distributions of your data

  • E.g., does \(p\) differ between two conditions?

Poisson distribution 1

In prior experiments the frequency of calcium events in WKY was 3.8 \(\pm\) 1.1 events/field/min compared to 18.9 \(\pm\) 7.1 in SHR

This is not normal (or binomial)

Something that happens a certain number of times in a fixed interval generates a Poisson distribution.

This is different from a normal or binomial distribution.

Poisson distribution 2

\[z \sim \textrm{poisson}(\lambda)\]

Poisson distribution

\[ \textrm{mean} = \lambda \] \[ \textrm{sd} = \sqrt{\lambda} \]

Expectation (\(\lambda\))

  • Only one parameter is provided, \(\lambda\): the rate with which the measured event happens

  • Suppose a county has population 100,000, and average rate of cancer is 45.2mn people each year

\[z \sim \textrm{poisson}(45,200,000/100,000) = \textrm{poisson}(4.52) \]

Design note

You need to design your experiments and analyses to reflect the appropriate process/probability distributions of your data

  • E.g., does \(\lambda\) differ between two conditions?

Binomial and Poisson distributions

Some important features

  • All measured values (and \(n\)) are positive whole numbers or zero; \(\lambda\), \(p\) may be positive real numbers or zero
  • The distributions may not be unimodal
  • The mean is not always the peak value (mode)
  • The distributions are not always symmetrical (so sd may not describe variation equally either side of the mean)

Distributions in Practice

Distributions are starting points

  • Distributions arise from and represent distinct generation processes (relate this to your biological system)
    • Normal distributions are generated by sums, differences, and averages
    • Poisson distributions are generated by counts (per unit interval)
    • Binomial distributions are generated by success/failure outcomes
  • Design experiments with analyses that reflect these processes

Warning

  • All statistical distributions are idealisations that ignore many features of real data
  • No real world data should be expected to exactly match any statistical distribution
  • Poisson models tend to need adjustment for overdispersion

Normal Distribution Redux

Probability mass

  • approximately 50% of the distribution lies in the range \(\mu \pm 0.68\sigma\)
  • approximately 68% of the distribution lies in the range \(\mu \pm \sigma\)
  • approximately 95% of the distribution lies in the range \(\mu \pm 2\sigma\)
  • approximately 99.7% of the distribution lies in the range \(\mu \pm 3\sigma\)

Estimates, standard errors, and confidence intervals

Parameters

Parameters are unknown numbers that determine a statistical model

A linear regression

\[ y_i = a + b x_i \]

  • Parameters are:
    • \(a\) (the intercept)
    • \(b\) (the gradient)

A normal distribution representing your data

\[ z \sim \textrm{normal}(\mu_z, \sigma) \]

  • Parameters are: \(\mu_z\) and \(\sigma\)

Estimands

An estimand (or quantity of interest) is a value that we are interested in estimating

A linear regression

\[ y_i = a + b x_i\]

  • We want to estimate values for:
    • \(a\) (the intercept)
    • \(b\) (the gradient)
    • predicted outcomes at important values of \(x_i\)

These are all estimands, and estimates are represented using the “hat” symbol: \(\hat{a}\), \(\hat{b}\), etc.

A normal distribution representing your data

\[ z \sim \textrm{normal}(\mu_z, \sigma) \]

  • Estimands are: \(\mu_z\) and \(\sigma\)
    • Maybe you want to determine the 95% confidence interval - this is also an estimand

Standard Errors and Confidence Intervals

  • The standard error is the estimated standard deviation of an estimate
    • It is a measure of our uncertainty about the quantity of interest

Note

  • Standard error gets smaller as sample size gets larger
    • You know more about the most likely value, the more data/information you collect
    • Standard error tends to zero as sample size gets large enough
  • The confidence interval (or CI) represents a range of values of a parameter or estimand that are roughly consistent with the data

Important

  • In repeated applications, the 50% confidence interval will include the true value 50% of the time
    • A 95% confidence interval will include the true value 95% of the time

Tip

  • The usual 95% confidence interval rule of thumb for large samples (assuming a normal distribution) is to take the estimate \(\pm\) two standard errors

Statistical significance and hypothesis testing

Statistical significance 1

  • Some scientists choose to consider a result to be “stable” or “real” if it is “statistically significant
  • They may also consider “non-signifcant” results to be noisy or less reliable

Warning

I, and many other statisticians, do not recommend this approach.

However, the concept is widespread and we need to discuss it

Statistical significance 2

A common definition

  • Statistical significance is conventionally defined as a threshold (commonly, a \(p\)-value less than 0.05) relative to some null hypothesis or prespecified value that indicates no effect is present.

  • E.g., an estimate may be considered “statistically significant at \(P < 0.05\)” if it:

    • lies at least two standard errors from the mean
    • is a difference that lies at least two standard errors from zero
  • More generally, an estimate is “not statistically significant” if, e.g.

    • the observed value can reasonably be explained by chance variation
    • it is a difference that lies less than two standard errors from zero

Most tests rely on probability distributions

  • We need to relate the measured values in the real world to an appropriate distribution that approximates them

A simple example: The experiment

The experiment

  • Two drugs, \(C\) and \(T\) lower cholesterol1, and we want to compare their effectiveness
  • We randomise assignment of \(C\) and \(T\) to members of a single cohort of comparable individuals, whose pre-treatment cholesterol level is assumed to be drawn from the same distribution (i.e. be approximately the same)
  • We measure the post-treatment cholesterol levels \(y_T\) and \(y_C\) for each individual in the two groups.
  • We calculate the average measured \(\bar{y}_T\) and \(\bar{y}_C\) for the treatment and control groups as estimates for the true post-treatment levels \(\theta_T\) and \(\theta_C\).
    • We also calculate standard deviation for the two groups, \(\sigma_T\) and \(\sigma_C\)

A simple example: The hypotheses

  • We want to know if the treatments have different sizes of effect
    • If they do, there should be a difference between the (average) post-treatment cholesterol level in each group
    • The true post-treatment levels are \(\theta_T\) and \(\theta_C\)
    • We have estimated means, \(\bar{y}_T\) and \(\bar{y}_C\) for post-treatment levels

The hypotheses

  • We are interested in \(\theta = \theta_T - \theta_C\), the expected post-test difference in cholesterol between the two groups \(T\) and \(C\).
  • Our null hypothesis (\(H_0\)) is that \(\theta = 0\), i.e. there is no difference (\(\theta_C = \theta_T\))
  • Our alternative hypothesis (\(H_1\)) is that there is a difference, so \(\theta \neq 0\), (i.e. \(\theta_C \neq \theta_T\))

A simple example: The distribution 1

  • To perform a statistical test, we may assume a distribution and parameters for the null hypothesis
    • We can then test the observed estimate against that distribution to see how likely it is that the null hypothesis would have generated it

The distribution

  • We use a probability distribution reflecting generation of the null hypothesis: \(\theta_C = \theta_T\)
    • This allows us to define a test statistic \(T\) (i.e. a threshold probability of “significance”) in advance
  • We test the estimated value from the experiment (\(\bar{y}_T - \bar{y}_C\)) to calculate a \(p\)-value for our estimate: \(p = \textrm{Pr}(T(y^{\textrm{null}}) > T(\bar{y}_T - \bar{y}_C))\)

A simple example: The null hypothesis

The null hypothesis

  • Assume that the true difference \(\theta\) is normally-distributed with \(\mu_\theta=0\), \(\sigma_\theta=1\)

A simple example: The estimated difference

Observed between post-treatment levels: \(\bar{y}_T - \bar{y}_C = -1.4\)

  • Is this an unlikely outcome given the null hypothesis?

A simple example: A significance threshold

We choose a significance threshold in advance

  • Suppose we set a threshold \(T\) corresponding to the 90% confidence interval (i.e. \(P<0.1\))
    • If the estimate is not in the central 90% of the distribution, we’ll say it’s “significant”

A simple example: Compare the estimate

Compare the estimate to the threshold

  • The estimate lies outwith the threshold, so we call the difference “significant”

A simple example: Another threshold

We choose a significance threshold in advance

  • Suppose we set the threshold \(T\) corresponding to the 95% confidence interval (i.e. \(P<0.05\)) instead?

A simple example: Another outcome

Compare the estimate to the threshold

  • The estimate lies within the threshold, so the difference is “not significant”

A simple example: What changed?

What did not change

  • The null hypothesis was the same
  • The observed estimate of difference was the same

What changed

  • Our choice of significance threshold changed

Significance threshold choice

  • Once the estimate is known, it is always possible to find a threshold that makes it “significant” or “not significant”
  • It is dishonest to select a threshold deliberately to make your result “significant” or “not significant”
  • Always choose and record (preregister) your threshold for significance ahead of the experiment

Tailed tests: two-tailed

Use two tails if direction of change doesn’t matter

  • With a two-tailed hypothesis test, we do not care which direction of change is significant

Tailed tests: one-tailed (left)

Use one-tailed tests when direction matters

  • If we’re testing specifically for a significant negative difference/reduction, use a left-tailed test
  • e.g. if we wanted to know if \(T\) reduced post-test levels with respect to \(C\) at a threshold of \(P < 0.05\)

Tailed tests: one-tailed (right)

Use one-tailed tests when direction matters

  • If we’re testing specifically for a positive difference/increase, use a right-tailed test
  • e.g. if we wanted to know if \(T\) increased post-test levels with respect to \(C\) at a threshold of \(P < 0.05\)

Problems with statistical significance 1

Warning

It is a common error to summarise comparisons by statistical significance into “significant” and “non-significant” results

Statistical significance is not the same as practical importance

  • Suppose a treatment increased earnings by £10 per year with a standard error of £2 (average salary £25,000).
    • This would be statistically, but not practically, significant
  • Suppose a different treatment increased earnings by £10,000 per year with a standard error of £10,000
    • This would not be statistically significant, but could be important in practice

Problems with statistical significance 2

Warning

It is a common error to summarise comparisons by statistical significance into “significant” and “non-significant” results

Non-significance is not the same as zero

  • Suppose an arterial stent treatment group outperforms the control
    • mean difference in treadmill time: 16.6s (standard error 9.8)
    • the 95% confidence interval for the effect includes zero, \(p ≈ 0.20\)
  • It’s not clear whether the net treatment effect is positive or negative
    • but we can’t say that stents have no effect

Problems with statistical significance 3

The difference between ‘significant’ and ‘not significant’ is not statistically significant

  1. At a \(P<0.05\) threshold, only a small change is required to move from \(P < 0.051\) to \(P < 0.049\)
  2. Large changes in significance can correspond to non-significant differences in the underlying variables

Problems with statistical significance 4

The difference between ‘significant’ and ‘not significant’ is not statistically significant

  1. At a \(P<0.05\) threshold, only a small change is required to move from \(P < 0.051\) to \(P < 0.049\)
  2. Large changes in significance can correspond to non-significant differences in the underlying variables

Experimental design and sample size decisions

Sampling and variance

We defined variance earlier for a distribution of random variable \(z\) as:

\[ \textrm{variance} = E((z - \mu_z)^2) \]

But this was for an infinite number of measurements of \(z\)

Important

We cannot make an infinite number of measurements of \(z\). We can only take a sample.

The variance we estimate in an experiment will not match that of the infinitely large population1.

(Unbiased) sample variance

We do not know the true population-level variance

So we calculate the unbiased sample variance with a correction for the sample size:

\[ \textrm{variance} = \frac{\sum^{n}_{i=1} (z_i - \mu_z)^2}{n - 1} \implies \textrm{standard deviation} = \sqrt{\frac{\sum^{n}_{i=1} (z_i - \mu_z)^2}{n - 1}} \]

References

References

Bate, Simon T., and Robin A. Clark. 2014. The Design and Statistical Analysis of Animal Experiments. Cambridge University Press.
Kilkenny, Carol, William J Browne, Innes C Cuthill, Michael Emerson, and Douglas G Altman. 2010. “Improving Bioscience Research Reporting: The ARRIVE Guidelines for Reporting Animal Research.” PLoS Biol. 8 (6): e1000412.
Kilkenny, Carol, Nick Parsons, Ed Kadyszewski, Michael F W Festing, Innes C Cuthill, Derek Fry, Jane Hutton, and Douglas G Altman. 2009. “Survey of the Quality of Experimental Design, Statistical Analysis and Reporting of Research Using Animals.” PLoS One 4 (11): e7824.